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SYSTEM AND METHOD FOR BLENDING SYNTHETIC VOICES 
BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0001] The present invention relates to synthetic voices and more specifically to a 
system and method of blending several different s)mthetic voices to obtain a new 
synthetic voice having at least one of the characteristics of the different voices. 

2. Introduction 

[0002] Text-to-speech (TTS) systems typically offer the user a choice of synthetic voices 
from a relatively small number of voices. For example, many systems allow users to 
select a male or female voice to interact with. When a person desires a voice having a 
particular feature, a user must select of voice that inherentiy has that characteristic such 
as a particular accent. This approach presents challenges for a user who may desire a 
voice having characteristics that are not available. There are not an unlimited number of 
TTS voices because each voice is cosdy and time consuming to generate. Therefore, 
there are a limited number of voices and voices having specific characteristics. 
[0003] Given the small number of choices available to the average user when selecting a 
synthetic voice, there is a need in the art for more flexibility to enable a user to obtain a 
synthetic voice having the desired characteristics. What is fiirther needed in the art is a 
system and method of obtaining a desired synthetic voice utilizing existing synthetic 
voices. 

SUMMARY OF THE INVENTION 

[0004] Additional features and advantages of the invention will be set forth in the 
description which follows, and in part will be obvious from the description, or may be 
learned by practice of the invention. The features and advantages of the invention may 
be realized and obtained by means of the instruments and combinations particularly 
pointed out in the appended claims. These and other features of the present invention 
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will become more fully apparent from the following description and appended claims, or 
may be learned by the practice of the invention as set forth herein. 
[0005] In its broadest terms, the present invention comprises a system and method of 
blending at least a first synthetic voice with a second synthetic voice to generate a new 
synthetic voice having characteristics of the first and second synthetic voices. The 
system may comprise a computer server or other computing device storing software 
operating to control the device to present the user with options to manipulate and 
receive synthetic voices comprising a blending of a first synthetic voice and a second 
synthetic voice. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0006] In order to describe the manner in which the above-recited and other advantages 
and features of the invention can be obtained, a more particular description of the 
invention briefly described above will be rendered by reference to specific embodiments 
thereof which are illustrated in the appended drawings. Understanding that these 
drawings depict only typical embodiments of the invention and are not therefore to be 
considered to be limiting of its scope, the invention will be described and explained with 
additional specificity and detail through the use of the accompanying drawings in which: 
[0007] FIG. 1 illustrates a webpage presenting a user with various synthetic voice 
options for selecting the characteristics of a synthetic voice; 

[0008] FIG. 2 illustrates a block diagram of the system aspect of the present invention; 
[0009] FIG. 3A shows an exemplary method according to an aspect of the present 
invention; and 

[0010] FIG. 3B shows another exemplary method according to another aspect of the 
invention. 
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DETAILED DESCRIPTION OF THE INVENTION 

[0011] The system and method of the present invention provide a user with a greater 
range of choice of synthetic voices than may otherwise be available. The use of synthetic 
voices is increasing in many aspects of human-computer interaction. For example, 
AT&T's VoiceTone^^ service provides a natural language interface for a user to obtain 
information about a user telephone account and services. Rather than navigating 
through a complicated touch-tone menu system, the user can simply speak and articulate 
what he or she desires. The service then responds with the information via a natural 
language dialog. The text-to-speech (TTS) component of the dialog includes a synthetic 
voice that the user hears. The present invention provides means for enabling a user to 
receive a larger selection of synthetic voices to suit the user's desires. 
[0012] FIG. 1 illustrates a simple example of a graphical user interface such as a web 
browser where the user has the option in the context of a TTS webpage 100 to select 
from a plurality of different voices and voice characteristics. Shown are a few samplings 
of potential choices. Under the voice selection section 102 the user can select from a 
male voice or a female voice. The emotion selection section 104 presents the user with 
options to select from a happy, sad or normal emotional state for the voice. An accent 
selection section 106 presents the user with accents such as French, German or a New 
York accent for the synthetic voice. 

[0013] FIG. 2 illustrates the general architecture of the invention. A synthetic voice 
server 206 provides the necessary software to present the user at a client device 202 or 
204 with options of synthetic voices from which to choose. The communication link 
208 between die client devices 202, 204 may be the World Wide Web, a wireless 
communication Unk or other type of communication. The server 206 communicates 
with a database 210 that stores synthetic voice data for use by the server 206 to generate 
a synthetic voice. Those of ordinary skill in the art will understand the basic 
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programming necessary to generate a synthetic TTS voice for use in a natural language 
dialog with a user. See, e.g., Huang, Acero and Hon, Spoken Language Processing . 
Prentice Hall PTR, 2001, Chapters 14-16. Therefore, the basic details of such a system 
are not provided herein. 

[0014] It is appreciated that the location of TTS software, the location of TTS voice 
data, and the location of client devices are not relevant to the present invention. The 
basic functionality of the invention is not dependent on any specific network or network 
configuration. Accordingly, the system of FIG. 2 is only presented as a basic example of 
a system that may relate to the present invention. 

[0015] FIG. 3A shows an example method according to an aspect of the invention. The 
method comprises presenting the user with at least two TTS voices (302). This step, for 
example, may occur in the server-client model where the server presents the user via a 
web browser or other means with a selection of TTS voices. At least two voices are 
presented to the user in this aspect of the invention. The method comprises receiving 
the user selection of at least two TTS voices (304) and presenting the user with at least 
one characteristic of each selected TTS voice (306). There are a number of 
characteristics that may be selected but examples include accent and pitch. The system 
presents the user with a new blended TTS voice (308) that reflects a blend of the 
characteristics of the two voices. For example, if the user selected a male voice and a 
German voice along with an accent characteristic, the new blended voice could be a male 
voice with a German accent. The new blended voice would be a composite or blending 
of the two previously existing TTS voices. 

[0016] FIG. 3A fiirther presents the user with options to adjust the new blended voice 
(310). If the user adjusts the blended voice, then the method receives die adjustments 
from the user (312) and the method returns to step (308) to present again the adjusted 
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blended voice to the user. If there are no user adjustments in step (310) then the method 
comprises presenting the user with a final blended voice for selection. 
[0017] FIG. 3B provides another aspect of the method of the present invention. The 
method in this aspect comprises presenting the user with at least one TTS voice and a 
TTS voice characteristic (320). The system receives a user selection of a TTS voice and 
the user-selected voice characteristic (322). The system presents the user with a new 
blended TTS voice comprising the selected TTS voice blended with at least one other 
TTS voice to achieve the selected voice characteristic (324). In this regard, the TTS 
voice characteristic is matched with a stored TTS voice to enable the blending of the 
presented TTS voice and a second TTS voice associated with the selected characteristic. 
[0018] An example of this new blended voice may be if the user selects a male voice and 
a German accent as the characteristic. The new blended voice may comprise a blending 
of the basic TTS male voice with one or more existing TTS voices to generate the male, 
German accent voice. The method then comprises presenting the user with options to 
make any user-selected adjustments (326). If adjustments are received (328), the method 
comprises making the adjustments and presenting a new blended TTS voice to the user 
for review (324), If no adjustments are received, then the method comprises presenting a 
final blended voice to the user for selection (330). 

[0019] The above descriptions of the basic steps according to the various aspects of the 
invention may be fiirther expanded upon. For example, when the user selects a voice 
characteristic, this may involve selecting a characteristic or parameter as well as a value of 
the parameter in a voice. In this regard, the user may select differing values of 
parameters for a new blended voice. Examples include a range of values for accent, 
pitch, friendliness, hipness, and so on. The accent may be a blend of U.K. English and 
U.S. English. Providing a sliding range of values of a parameter enables the user to 
create a preferred voice in an almost unlimited number of ways. As another example, if 
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the parameter range for each characteristic is a range of 0 (no presence of the 
characteristic) to 10 (full presentation of this characteristic in the blended voice), the user 
could select U.K. English at a value of say 6, and U.S. English at a value of 3, and a 
friendliness value of 9, and so on to create their voice. Thus, the new blended voice will 
be a weighted average of existing TTS voices according to user-selected parameters and 
characteristics. As can be appreciated, in a database of TTS voices, each voice will be 
characterized and categorized according to its parameters for selection in the blending 
process. 

[0020] Some of the characteristics of voices are discussed next. Accent, the "locality** of 
a voice, is determined by the accent of the source voice(s). For best results, an 
interpolated voice in U.S. EngUsh is constructed only from U.S. English source voices. 
Some attributes of any accent, such as accent-specific pronunciations, are carried by the 
TTS front-end in, for example, pronunciation dictionaries. Pitch is determined by a 
Pitch Prediction module with the TTS system that contributes desired pitch values to a 
symbolic query string for a unit selection module. The basic concept of unit selection is 
well known in the art. To synthesize speech, small units of speech are selected and 
concatenated together and further processed to sound natural. The unit selection 
module manages this process to select the best stored units of sound (which may be 
either a phoneme, diphone, etc. and may include an entire sentence). 
[0021] The speech segments delivered by the unit selection module are then pitch 
modified in the TTS back-end. One example method of performing a pitch modification 
is to apply pitch synchronous overlap and add (PSOLA). The pitch prediction model 
parameters are trained using recording from the source voices. These model parameters 
can then be interpolated with weights to create the pitch model parameters for the 
interpolated voice. Emotions, such as happiness, sadness, anger, etc, are primarily driven 
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by using emotionally marked sections of the recorded voice databases. Certain aspects, 
such as emotion-specific pitch ranges, are set by emotional category and/ or user input. 
[0022] Given fixed categories of accent and emotion, speech database units of different 
speakers in the same category can be blended in a number of different ways. One way is 
the following: 

(a) Parameterizing the speech segments into segment parameters (for 
example, in terms of Linear-Predictive Coding (LPC) spectral envelopes); 

(b) Interpolating between corresponding speech segmental parameters of 
different speakers employing weights provided by the user; and 

(c) Using the interpolated parameters to re-synthesize speech for the 
interpolated voice. 

[0023] The best results when practicing the invention occur when all the speakers in a 
given category record the same text corpus. Further, for best results, individual speech 
units should be interpolated that came from the same utterances, for example, / ae/ from 
the word "cat*' in the sentence "The cat crossed the road", uttered by all the source 
speakers using the same emotional setting, such as "happy." 

[0024] A variety of speech parameters may be utilized when blending the voices. For 
example, equivalent parameters include, but are not limited to, line spectral frequencies, 
reflection coefficients, log-area ratios, and autocorrelation coefficients. When LPC 
parameters are interpolated, the corresponding data associated with the LPC residuals 
needs to be interpolated also. line Spectral Frequency (LSF) representation is the most 
widely accepted representation of LPC parameters for quantization, since they posses a 
number of advantageous properties including filter stability preservation. This 
interpolation can be done, for example, by splitting the LPC residual into harmonic and 
noise components, estimating speaker-specific distributions for individual harmonic 
amplitudes, as well as for the noise components, and interpolating between them. Each 
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of these parameters are frame-based parameters, roughly meaning that they exhibit a 
short time frame of around 20 ms or less. 

[0025] Other parameters may also be utilized for blending voices. In addition to the 
frame-based parameters discussed above, phoneme-based, diphone-based, triphone- 
based, demisyllable-based, syllable-based, word-based, phrase-based and general or 
sentence-based parameters may be employed. These parameters illustrate different 
features. The frame-based parameters exhibit a short term spectrum, the phone-based 
parameters characterize vowel color, the syllable-based parameters illustrate stress timing 
and the general or sentence-based parameters illustrate mood or emotion. 
[0026] Other parameters may include prosodic aspects to capture the specifics of how a 
person is saying a particular utterance. Prosody is a complex interaction of physical, 
phonetic effects that is employed to express attitude, assumptions, and attention as a 
parallel channel in speech communication. For example, prosody communicates a 
speaker's attitude towards the message, towards the listener, and to the communication 
event. Pauses, pitch, rate and relative duration and loudness are the main components of 
prosody. While prosody may carry important information that is related to a specific 
language being spoken, as it is in Mandarin Chinese, prosody can also have personal 
components that identify a particular speaker's manner of communicating. Given the 
amount of information within prosodic parameters, an aspect of the present invention is 
to utilize prosodic parameters in voice blending. For example, low-level voice prosodic 
attributes that may be blended include pitch contour, spectral envelope (LSF, LPC), 
volume contour and phone durations. Other higher-level parameters used for blending 
voices may include syllable and language accents, stress, emotion, etc. 
[0027] One method of blending these segment parameters is to extract the parameter 
from the residual signal associated with each voice, interpolating between the extracted 
parameters and combining the residuals to obtain a representation of a new segment 
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parameter representing the combination of the voices. For example, a system can extract 
the pitch as a prosodic parameter from each of two TTS voices and interpolate between 
the two pitches to generate a blended pitch. 

[0028] Yet fixrther parameters that may be utilized include speaker-specific 
pronunciations. These may be more correcdy termed "mis-pronunciations" in that each 
person deviates from the standard pronunciation of words in a specific way. These 
deviations that relate to a specific person's speech pattern and can act like a speech 
fingerprint to identify the person. An example of voice blending using speaker-specific 
pronunciations would be a response to a user's request for a voice that sounded like their 
voice with Arnold Schwarzenegger's accent. In this regard, the specific mis- 
pronunciations of Arnold Schwarzenegger would be blended with the user's voice to 
provide a blended voice having both characteristics. 

[0029] One example method for organizing this information is to establish a voice 
profile which is a database of all speaker-specific parameters for all time scales. This 
voice profile is then used for voice selection and blending purposes. The voice profile 
organizes the various parameters for a specific voice that can be utilized for blending one 
or more of the voice characteristics. 

[0030] Embodiments within the scope of the present invention may also include 
computer-readable media for carrying or having computer-executable instructions or data 
structures stored thereon. Such computer-readable media can be any available media that 
can be accessed by a general purpose or special purpose computer. By way of example, 
and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, 
CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage 
devices, or any other medium which can be used to carry or store desired program code 
means in the form of computer-executable instructions or data structures. When 
information is transferred or provided over a network or another communications 
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connection (either hardwired, wireless, or combination thereof to a computer, the 
computer properly views the connection as a computer-readable medium. Thus, any 
such connection is properly termed a computer-readable medium. Combinations of the 
above should also be included within the scope of the computer-readable media. 
[0031] Computer-executable instructions include, for example, instructions and data 
which cause a general purpose computer, special purpose computer, or special purpose 
processing device to perform a certain function or group of functions. Computer- 
executable instructions also include program modules that are executed by computers in 
stand-alone or network environments. Generally, program modules include routines, 
programs, objects, components, and data structures, etc. that perform particular tasks or 
implement particular abstract data types. Computer-executable instructions, associated 
data stmctures, and program modules represent examples of the program code means 
for executing steps of the methods disclosed herein. The particular sequence of such 
executable instructions or associated data structures represents examples of 
corresponding acts for implementing the functions described in such steps. 
[0032] Those of skill in the art will appreciate that other embodiments of the invention 
may be practiced in network computing environments with many types of computer 
system configurations, including personal computers, hand-held devices, multi-processor 
systems, microprocessor-based or programmable consumer electronics, network PCs, 
minicomputers, mainframe computers, and the like. Embodiments may also be practiced 
in distributed computing environments where tasks are performed by local and remote 
processing devices that are linked (either by hardwired links, wireless links, or by a 
combination thereof through a communications network. In a distributed computing 
environment, program modules may be located in both local and remote memory storage 
devices. 
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[0033] Although the above description may contain specific details, they should not be 
construed as limiting the claims in any way. Other configurations of the described 
embodiments of the invention are part of the scope of this invention. For example, the 
parameters of the TTS voices that may be used for interpolation in the process of 
blending voice may be any parameters, not just the LPC, LSF and other parameters 
discussed above. Further, other synthetic voices, not just specific TTS voices may be 
developed that are represented by a type of segment parameter. Accordingly, the 
appended claims and their legal equivalents should only define the invention, rather than 
any specific examples given. 
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